Feature/prefetch2 by maddyscientist · Pull Request #1604 · lattice/quda

maddyscientist · 2025-12-03T01:11:22Z

This work is latest towards optimizing QUDA for Blackwell:

Adds supports for "spatial prefetching", where we over fetch data to L2 when issuing a global load. Exposed as an optional template parameter to vector_load. At present, not deployed anywhere.
Add support for prefetching instructions, in the form of both per-thread prefetching (which works on all CUDA architectures), and TMA-based prefetching, which is Hopper+ only. Prefetching type is set using QUDA_DSLASH_PREFETCH CMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptor
Add an experimental L1 prefetch (using LDGSTS). Disabled, but left for future experiments.
Add single-threaded execution region helper function target::is_thread_zero() which should be used for TMA issuance.
Optionally store the backward shifted gauge field. This simplifies all dslash indexing, as all spatial indices thus correspond to "this" site. Enabled with QUDA_DSLASH_DOUBLE_STORE=ON which is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.
Added prefetching support to both Wilson and Staggered dslash kernels, parameterized using QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED CMake parameters.
Optimization of the neighbor indexing for the dslash kernels. This reduced integer instruction overheads.
Reduction in pointer arithmetic overhead (use more 32-bit integer operations where possible). Added three operand and four operand variants of vector_load and vector_store to this end (respectively).
Optimize FFMA2 issuance to reduce total number of floating point instructions on Blackwell
Optimization of short <-> float conversation to reduce instruction overheads
Optimization of staggered packing kernels (replace division by int with division by fast_intdiv)
Extends some OpenMP parallelization on the host that was missing.

The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.

…tions for CUDA

…tead of logic operations when computing the neighboring index; this is branch free and less operations

…d by default

…quarter precision support

…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features

…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push

…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)

…ble on CUDA platform

…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads

…d and vector_store to reduce indexing overheads

…tNOrder uses optimized 3-operand indexing

TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and later architectures. This commit wraps the cuTensorMapEncodeTiled calls with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.

…, placing the end face (which is otherwise lost) into the ghost

…the gauge field from the ghost region - ensures coalesced access regarding less of partitioning

… - comms partitioning were effectively disabled for testing

…ggered

…distance of 2

…ow created unless TENSOR prefetching type is enabled

havogt · 2026-02-06T08:17:26Z

cscs-ci run

…ure monitoring

…ug hunting

kostrzewa · 2026-03-11T09:07:06Z

Is performance on AMD regularly benchmarked "officially"? If so, what is being benchmarked?

After recent updates on Lumi-G I had to update our production stack. I was not able to compile the head commit of the develop branch any more (related to what is observed in #1617 I think):

/users/bakostrz/code/quda-develop-e318708/include/targets/hip/../generic/shared_memory_cache_helper.h:127:7: error: no matching constructor for initialization of 'SharedMemory<atom_t<complex<int>[8][8][4][2][2]>, SizeDims<DimsStaticConditional<2, 1, 1>, sizeof(complex<int>[8][8][4][2][2]) / sizeof(atom_t<complex<int>[8][8][4][2][2]>)>, void>' (aka 'SharedMemory<HIP_vector_type<int, 4>, SizeDims<quda::DimsStaticConditional<2, 1, 1>, sizeof(quda::complex<int>[8][8][4][2][2]) / sizeof(atom_t<complex<int>[8][8][4][2][2]>)>, void>')
  127 |       Smem(ops, arg...), block(D::dims(target::block_dim(), arg...)), stride(block.x * block.y * block.z)

nor the (now very old) commit that we used on Lumi-G previously (6198d60):

In file included from /users/bakostrz/code/quda-develop-6198d6/lib/../include/float_vector.h:10:
/users/bakostrz/code/quda-develop-6198d6/lib/../include/complex_quda.h:425:13: error: no member named 'x' in 'complex<ValueType>'
  425 |       this->x *= z;
      |       ~~~~  ^
/users/bakostrz/code/quda-develop-6198d6/lib/../include/complex_quda.h:426:13: error: no member named 'y' in 'complex<ValueType>'
  426 |       this->y *= z;

which I couldn't fix by trying to backport the changes to quda::complex.

We figured out that the feature/prefetch2 branch compiles, but I observe substantial performance regressions in our tmLQCD+QUDA HMC compared to our production setup which was running until December 2025:

about 30% in MG solves as used in our HMC
a factor > 2 in updateMultigridQuda
a factor > 2.5 in double-half mixed-precision CG (strangely not always)
a factor > 2 in single precision multi-shift solves with double-half refinement

Overall this leads to a factor > 2 increase in time per trajectory unfortunately.

I'm unable to pin down what is responsible as we had to update from rocm-5.6.1 (very old, I know, but that was what was available on Lumi-G at the time) to rocm-6.3.4 or rocm-6.4.4 AND make a very large jump in QUDA version.

maddyscientist · 2026-03-11T17:11:33Z

@kostrzewa thanks for the report on where things stand on ROCm. I think the issue with compilation should be fixed with 9b83fde, which @weinbe2 has just cheery picked into his PR that will be merged shortly, so that should sort out develop working again. We should also make sure that the ROCm CI is captures this failure, which I guess means an update to the CI might be needed (@dmcdougall)?

Regarding the performance regression, do you happen to have a tune cache to hand for before and after? That would help guide us as to where the regression is. I suspect the issue is a compiler driven regression in the dslash performance, but it could also be changes in QUDA itself.

Since that old version of QUDA, on of the biggest changes has been the default data ordering has changed, to what I call "maximal vectorization". What this means, is for example we previously would have use 9x float2 SoA ordering for an 18-real-value gauge field, we would now use 4x float4 + 1x float2 ordering on Hopper, or 2x float8 + 1x float2 ordering on Blackwell. The motivation here is to reduce overall instructions (indexing and load instructions). I did incorporate the ability to use the legacy ordering though in case of regressions. To enable this, set -DQUDA_ORDER_DOUBLE=0 -DQUDA_ORDER_SINGLE=0 -DQUDA_ORDER_HALF=0 and recompile.

…with shifting (can't shift a shifted field), and fix move constructor so that shift field is moved

…ure/prefetch2

…g some bug hunting" This reverts commit 8c7ba4d.

kostrzewa · 2026-03-12T14:22:22Z

@maddyscientist

Regarding the performance regression, do you happen to have a tune cache to hand for before and after? That would help guide us as to where the regression is. I suspect the issue is a compiler driven regression in the dslash performance, but it could also be changes in QUDA itself.

Ah, I always forget to look at the tunecaches. Yes, please find them attached here:

quda_amd_perf_regression.tar.gz

the directory names in the archive should be reasonably self-explanatory.

Looking at some of the kernels in profile_async_0.tsv seems to confirm my observations from the tmLQCD-internal timers w.r.t. to the MG as well as the ndeg twisted clover half precision kernels:

new:      744.781         7.81725         32.2379          242426       0.0030722       56x28x28x16x2        N4quda31NdegTwistedCloverPreconditionedINS_34NdegTwistedCloverPreconditionedArgIsLi3ELi4ENS_4DDNoEL21QudaReconstructType_s18EEEEE       policy,GPU-offline,kernel_arg_threshold=4096,vol=1404928,parity=1,precision=2,Ns=4,Nc=3,order=0,N=8,alt_i2f=0,TwistFlavor=2,commDim=0111,dagger,n_rhs=1,n_rhs_tile=1,topo=14414,order=01234567,p2p=0,gdr=1,nvshmem=0,pol=111111000011000000       # 878.02 Gflop/s, 391.45 GB/s, tuning took 0.235447 seconds at Thu Mar  5 17:43:55 2026

old:      277.445         5.18452          51.208          229761      0.00120754       56x28x28x16x2        N4quda31NdegTwistedCloverPreconditionedINS_34NdegTwistedCloverPreconditionedArgIsLi3ELi4EL21QudaReconstructType_s18EEEEE        policy,GPU-offline,vol=1404928,parity=1,precision=2,order=8,Ns=4,Nc=3,TwistFlavor=2,commDim=0111,dagger,topo=14414,order=01234567,p2p=0,gdr=1,nvshmem=0,pol=111111111111110000    # 2233.85 Gflop/s, 995.93 GB/s, tuning took 0.198384 seconds at Mon Oct 27 19:20:54 2025

I did incorporate the ability to use the legacy ordering though in case of regressions. To enable this, ...

I'll try this right away, thanks!

Going back to the legacy order helps a little.

The situation is subtle because on a 32c64 lattice on 2 nodes (16 GCDs), the prefetch2 branch and legacy order I actually see a slight overall performance improvement with CrayEnv_gnu_rocm_644 over the old commit with gnu_env_23_09_rocm_561.

On 28 nodes on a 112c224 lattice instead I see, as an example:

rocm-561 / `6198d60`:

MultiShiftCG: Converged after 1113 iterations
MultiShiftCG:  shift=0, 1113 iterations, relative residual: iterated = 7.732915e-07
MultiShiftCG:  shift=1, 1113 iterations, relative residual: iterated = 1.088741e-09
MultiShiftCG:  shift=2, 604 iterations, relative residual: iterated = 1.064384e-09
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 1230 iterations, L2 relative residual: iterated = 9.987934e-12, true = 9.987934e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 388 iterations, L2 relative residual: iterated = 9.899012e-12, true = 9.899012e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 232 iterations, L2 relative residual: iterated = 9.888607e-12, true = 9.888607e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 2.549249e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 2963 iter / 25.4923 secs = 304594 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 2.636584e+01 s level: 4 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift

rocm-644 / prefetch2 `3c8ed1a` / defaults

MultiShiftCG: Converged after 1113 iterations
MultiShiftCG:  shift=0, 1113 iterations, relative residual: iterated = 7.704380e-07
MultiShiftCG:  shift=1, 1113 iterations, relative residual: iterated = 1.084594e-09
MultiShiftCG:  shift=2, 604 iterations, relative residual: iterated = 1.061401e-09
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 1229 iterations, L2 relative residual: iterated = 9.912935e-12, true = 9.912935e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 619 iterations, L2 relative residual: iterated = 9.893005e-12, true = 9.893005e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 368 iterations, L2 relative residual: iterated = 9.933633e-12, true = 9.933633e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 4.652547e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 3329 iter / 46.5254 secs = 187391 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 4.740475e+01 s level: 4 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift

rocm-644 / prefetch2 `3c8ed1a` / legacy order

MultiShiftCG: Converged after 1115 iterations
MultiShiftCG:  shift=0, 1115 iterations, relative residual: iterated = 7.645503e-07
MultiShiftCG:  shift=1, 1115 iterations, relative residual: iterated = 1.075038e-09
MultiShiftCG:  shift=2, 604 iterations, relative residual: iterated = 1.065411e-09
# QUDA: Refining shift 0: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 1231 iterations, L2 relative residual: iterated = 9.962922e-12, true = 9.962922e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 1: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 540 iterations, L2 relative residual: iterated = 9.884733e-12, true = 9.884733e-12 (requested = 1.000000e-11)
# QUDA: Refining shift 2: L2 residual inf / 1.000000e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 262 iterations, L2 relative residual: iterated = 9.888722e-12, true = 9.888722e-12 (requested = 1.000000e-11)
# TM_QUDA: Time for invertMultiShiftQuda 3.699154e+01 s level: 5 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/invertMultiShiftQuda
[...]
# TM_QUDA: QpQm solve done: 3148 iter / 36.9926 secs = 222860 Gflops
# TM_QUDA: Time for invert_eo_quda_twoflavour_mshift 3.787983e+01 s level: 4 proc_id: 0 /HMC/ndcloverrat4:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift

Note that these inversions have identical starting conditions. I guess it's the autotuning which causes the iteration numbers to differ a little. The main point is the time per iteration though / the reported performance.

Sorry for polluting the discussion here with so much stuff. I guess I should have opened a new issue for this...

maddyscientist added 30 commits September 19, 2025 16:42

Initial support for prefetching (over fetching) added to load instruc…

63b7ff4

…tions for CUDA

Fix for half precision

191105b

Apply some missing OMP parallelization to host functions

5b41229

Fix for fine-grained accessor vector loads

a2efb44

Add prefetching instructions for CUDA

c815076

Optimizaiton of neighbor indexing for dslash kernels: use bitwise ins…

177c18b

…tead of logic operations when computing the neighboring index; this is branch free and less operations

Add support for creating a backward gauge field

eae953d

Some small improvedments to shift(GaugeField) function

2540a1b

Gauge shift should encode shift value in aux_string

e686437

Add support for experimental double storage of gauge fields - disable…

676c643

…d by default

Fix some issues with gauge shift: fix single-GPU builds and add half/…

9c2025b

…quarter precision support

make doBulk and doHalo constexpr

721fbd5

Add target::is_thread_zero and target::is_lane_zero helper functions …

02a4cb9

…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features

Expose prefetching instructions

33b5f2f

Add prefetching support to gauge and colorspinor fields

ccf7a55

Add L2 gauge-field prefetching support to both Wilson and staggered d…

0642f63

…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push

QUDA_DSLASH_DOUBLE_STORE is now a CMake parameter

72a001f

Add TMA prefetch support for Wilson and staggered fermions (enabled w…

02e7bc3

…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)

Add target::uniform helper which is used to create warp-uniform varia…

7bb5cdc

…ble on CUDA platform

Fix typo in last commit

f42a507

Fix bug with non-double-store staggered dslash

e2df25f

Fix bug with parity setting

3010aa6

Fix bulk prefetch of phase

acfaf5b

Add 3-d and 4-d TMA prefetch instructions

67f8ce4

first version of tensor descriptor TMA prefetch - almost certainly buggy

946bed0

Fix some warnings and set Uback tensor descriptor for wilson dslash

d772d5f

colorspinor::FloatNOrder load/save functions use 3-operand vector_loa…

9910869

…d and vector_store to reduce indexing overheads

Continued improvements to tensor TMA prefetch variant and gauge::Floa…

b9a4d5f

…tNOrder uses optimized 3-operand indexing

maddyscientist added 14 commits January 30, 2026 11:44

Gause shift kernel now fills in the ghost region of the shifted field…

a8d4a0a

…, placing the end face (which is otherwise lost) into the ghost

When double-store is enabled, when doing the halo update always read …

73f46af

…the gauge field from the ghost region - ensures coalesced access regarding less of partitioning

Fix bug with staggered dslash test where partitioning was being reset…

37cfc7b

… - comms partitioning were effectively disabled for testing

Selecting the type of prefetching to use is now more verbose.

e223bfa

Runtime warning if dslash prefetch distance exceeds max for naive sta…

ea36ced

…ggered

Fix ROCm compilation

3b25ff5

Make HIP shared memory helpers match CUDA versions

9b83fde

Blackwell now defaults to using BULK TMA prefetching with a prefetch …

709b7f9

…distance of 2

Signficant cleanup of TENSOR variant of prefetching. Descriptor not n…

305884e

…ow created unless TENSOR prefetching type is enabled

Fix CI

06413d0

Fix type with twisted mass

dd77fc0

Increase TuneKey::aux_n to prevent buffer overflow

a265269

value to reference - fixes clang compilation issue

f92570e

Add git to docker file for CSCS

3ada421

maddyscientist added 3 commits February 6, 2026 10:27

Fix deprecation warning with recent CUDA 13.1 regarding NVML temperat…

2125574

…ure monitoring

Make the NVML temperature query more robust for the change in interface

951a3ee

Fix CLI11 for modern compilers

3c8ed1a

maddyscientist mentioned this pull request Feb 11, 2026

Fix compilation errors when building QUDA by HIP in DTK environment #1617

Open

Temporary change of default prefetch type on sm100 while doing some b…

8c7ba4d

…ug hunting

This was referenced Mar 10, 2026

Schwarz preconditioned CG gives nans in develop (again) #1601

Open

Codifying host verify cleanup/reducing compiler warnings and errors #1618

Merged

kostrzewa mentioned this pull request Mar 11, 2026

[CI] add mi300 pipeline etmc/tmLQCD#669

Open

maddyscientist added 3 commits March 11, 2026 23:17

Fix bug in gauge shift when writing its halo. Add some sanity checks …

a510234

…with shifting (can't shift a shifted field), and fix move constructor so that shift field is moved

Merge branch 'feature/prefetch2' of github.com:lattice/quda into feat…

d460006

…ure/prefetch2

Revert "Temporary change of default prefetch type on sm100 while doin…

b0f2a86

…g some bug hunting" This reverts commit 8c7ba4d.

SaltyChiang mentioned this pull request Mar 13, 2026

Performance regression on gfx906 (maybe also gfx90a) devices. #1620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/prefetch2#1604

Feature/prefetch2#1604
maddyscientist wants to merge 124 commits intodevelopfrom
feature/prefetch2

maddyscientist commented Dec 3, 2025

Uh oh!

havogt commented Feb 6, 2026

Uh oh!

kostrzewa commented Mar 11, 2026 •

edited

Loading

Uh oh!

maddyscientist commented Mar 11, 2026

Uh oh!

kostrzewa commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maddyscientist commented Dec 3, 2025

Uh oh!

havogt commented Feb 6, 2026

Uh oh!

kostrzewa commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maddyscientist commented Mar 11, 2026

Uh oh!

kostrzewa commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

rocm-561 / 6198d60:

rocm-644 / prefetch2 3c8ed1a / defaults

rocm-644 / prefetch2 3c8ed1a / legacy order

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kostrzewa commented Mar 11, 2026 •

edited

Loading

kostrzewa commented Mar 12, 2026 •

edited

Loading

rocm-561 / `6198d60`:

rocm-644 / prefetch2 `3c8ed1a` / defaults

rocm-644 / prefetch2 `3c8ed1a` / legacy order